Getting Meta with Big Data Malaysia

Scraping the Big Data Malaysia Facebook group for fun. Profit unlikely.

Hello World

This is an introductory-level notebook demonstrating how to deal with a small, but meaty dataset. Things we will do here include:

  • Loading a JSON dataset.
  • Dealing with a minor data quality issue.
  • Handling timestamps.
  • Dataset slicing and dicing.
  • Plotting histograms.

A "follow the data" approach will be taken. This notebook may appear quite long, but a good portion of the length is pretty-printing of raw data which noone is expected to read in entirety, but it's there for one to skim to get an idea of the structure of our data.

Get all the data

This notebook assumes you have already prepared a flattened JSON file into all_the_data.json, which you would have done by:

  • Writing your oauth token into oauth_file according to the instructions in pull_feed.py.
  • Running python pull_feed.py to pull down the feed pages into the BigDataMyData directory.
  • Running python flatten_saved_data.py > all_the_data.json.

In [1]:
# we need this for later:
%matplotlib inline

import json
INPUT_FILE = "all_the_data.json"
with open(INPUT_FILE, "r") as big_data_fd:
	big_data = json.load(big_data_fd)

Is it big enough?

Now we have all our data loaded into variable big_data, but can we really say it's Big Data?


In [2]:
print "We have {} posts".format(len(big_data))


We have 1946 posts

Wow! So data! Very big!

Seriously though... it's not big. In fact it's rather small. How small is small? Here's a clue...


In [3]:
import os
print "The source file is {} bytes. Pathetic.".format(os.stat(INPUT_FILE).st_size)


The source file is 3773450 bytes. Pathetic.

At the time this was written, the file was just about 3MB, and there were fewer than 2k posts... note that excludes comments made on posts, but still, this stuff is small. It is small enough that at no point do we need to do anything clever from a data indexing/caching/storage perspective, so to start we will take the simplistic but often appropriate approach of slicing and dicing our big_data object directly. Later on we'll get into pandas DataFrame objects.

Anyway, size doesn't matter. It's variety that counts.

Fields of gold

Now we know how many elements (rows I guess?) we have, but how much variety do we have in this data? One measure of this may be to look at the number of fields in each of those items:


In [4]:
import itertools
all_the_fields = set(itertools.chain.from_iterable(big_data))
print "We have {} different field names:".format(len(all_the_fields))
print all_the_fields


We have 30 different field names:
set([u'application', u'actions', u'likes', u'created_time', u'message', u'id', u'story', u'from', u'subscribed', u'privacy', u'comments', u'shares', u'to', u'story_tags', u'type', u'status_type', u'picture', u'description', u'object_id', u'link', u'properties', u'icon', u'name', u'message_tags', u'with_tags', u'updated_time', u'caption', u'place', u'source', u'is_hidden'])

Are we missing anything? A good way to sanity check things is to actually inspect the data, so let's look at a random item:


In [5]:
import random
import pprint
# re-run this as much as you like to inspect different items
pprint.pprint(random.choice(big_data))


{u'actions': [{u'link': u'https://www.facebook.com/497068793653308/posts/961960940497422',
               u'name': u'Comment'},
              {u'link': u'https://www.facebook.com/497068793653308/posts/961960940497422',
               u'name': u'Like'},
              {u'link': u'/groups/bigdatamy/', u'name': u'Create Group Chat'}],
 u'application': {u'id': u'183319479511',
                  u'name': u'Hootsuite',
                  u'namespace': u'hootsuiteprod'},
 u'created_time': u'2014-10-21T02:15:17+0000',
 u'from': {u'id': u'10152418624011789', u'name': u'John F.X. Berns'},
 u'id': u'497068793653308_961960940497422',
 u'is_hidden': False,
 u'message': u'Hadoop World: The executive dashboard is on the way out - http://ow.ly/D3eF1',
 u'privacy': {u'allow': u'',
              u'deny': u'',
              u'description': u'',
              u'friends': u'',
              u'value': u''},
 u'to': {u'data': [{u'id': u'497068793653308',
                    u'name': u'Big Data Malaysia'}]},
 u'type': u'status',
 u'updated_time': u'2014-10-21T02:15:17+0000'}

From that you should be able to sense that we are missing some things - it isn't simply that there are some number of fields that describe each item, because some of those fields have data hierarchies beneath them, for example:


In [6]:
pprint.pprint(big_data[234])


{u'actions': [{u'link': u'https://www.facebook.com/497068793653308/posts/1032324310127751',
               u'name': u'Comment'},
              {u'link': u'https://www.facebook.com/497068793653308/posts/1032324310127751',
               u'name': u'Like'},
              {u'link': u'/groups/bigdatamy/', u'name': u'Create Group Chat'}],
 u'comments': [{u'data': [{u'can_remove': True,
                           u'created_time': u'2015-02-02T14:20:46+0000',
                           u'from': {u'id': u'10203864949854090',
                                     u'name': u'Teuku Faruq'},
                           u'id': u'1033140356712813',
                           u'like_count': 1,
                           u'message': u'Interesting startup, all the best!',
                           u'user_likes': False},
                          {u'can_remove': True,
                           u'created_time': u'2015-02-04T07:45:13+0000',
                           u'from': {u'id': u'10203477707997024',
                                     u'name': u'Syed Ahmad Fuqaha'},
                           u'id': u'1034073379952844',
                           u'like_count': 0,
                           u'message': u'Thank you Teuku Faruq!',
                           u'message_tags': [{u'id': u'10203864949854090',
                                              u'length': 11,
                                              u'name': u'Teuku Faruq',
                                              u'offset': 10,
                                              u'type': u'user'}],
                           u'user_likes': False}],
                u'paging': {u'cursors': {u'after': u'WTI5dGJXVnVkRjlqZFhKemIzSTZNVEF6TkRBM016TTNPVGsxTWpnME5Eb3hOREl6TURNMU9URXo=',
                                         u'before': u'WTI5dGJXVnVkRjlqZFhKemIzSTZNVEF6TXpFME1ETTFOamN4TWpneE16b3hOREl5T0RnMk9EUTI='}}}],
 u'created_time': u'2015-02-01T05:42:29+0000',
 u'from': {u'id': u'10203477707997024', u'name': u'Syed Ahmad Fuqaha'},
 u'id': u'497068793653308_1032324310127751',
 u'is_hidden': False,
 u'likes': {u'data': [{u'id': u'10202838533911870',
                       u'name': u'Firdaus Adib'},
                      {u'id': u'10203864949854090', u'name': u'Teuku Faruq'},
                      {u'id': u'10152208407631596',
                       u'name': u'Bok Cabradilla'},
                      {u'id': u'10152535806106552', u'name': u'Brian Ho'},
                      {u'id': u'10152539541773459', u'name': u'Mohd Naim'},
                      {u'id': u'10152629084603737', u'name': u'Tajul Azhar'},
                      {u'id': u'10152569565771111',
                       u'name': u'Daniel Walters'},
                      {u'id': u'10154115325260227',
                       u'name': u'AWoon Haw Brando'},
                      {u'id': u'10204389172318345',
                       u'name': u'Fairul Syarmil'},
                      {u'id': u'10152528794631844',
                       u'name': u'Sandra Hanchard'}],
            u'paging': {u'cursors': {u'after': u'MTAxNTI1Mjg3OTQ2MzE4NDQ=',
                                     u'before': u'MTAyMDI4Mzg1MzM5MTE4NzA='}}},
 u'message': u'Thank you for the approval. Im part of www.katsana.com, a local startup specializing in GPS tracking & fleet management system. Hope to be able to contribute to this group and learn from the masters.',
 u'privacy': {u'allow': u'',
              u'deny': u'',
              u'description': u'',
              u'friends': u'',
              u'value': u''},
 u'to': {u'data': [{u'id': u'497068793653308',
                    u'name': u'Big Data Malaysia'}]},
 u'type': u'status',
 u'updated_time': u'2015-02-04T07:45:13+0000'}

From that we can see some fields have hierarchies within them, e.g. likes have a list of id dictionaries, which happen to be relatively trivial (names and ids... I wonder why Facebook didn't just post the id and make you look up the name?) but the comment field is a bit more complex, wherein it contains a list of dictionaries with each field potentially being a dictionary of its own, e.g. we can see that the second comment on that post tagged Teuku Faruq:


In [7]:
pprint.pprint(big_data[234]['comments'][0]['data'][1]['message_tags'])


[{u'id': u'10203864949854090',
  u'length': 11,
  u'name': u'Teuku Faruq',
  u'offset': 10,
  u'type': u'user'}]

Data quality annoyances

Actually I'm not even sure why the comments field is a single entry list. Is that always the case?


In [8]:
set([len(data['comments']) for data in big_data if 'comments' in data])


Out[8]:
{1, 2}

Apparently that's not always the case, sometimes there are 2 items in the list, let's see what that looks like...


In [9]:
multi_item_comment_lists = [data['comments'] for data in big_data if ('comments' in data) and (len(data['comments']) > 1)]
print len(multi_item_comment_lists)
pprint.pprint(multi_item_comment_lists[0])


4
[{u'data': [{u'can_remove': True,
             u'created_time': u'2015-02-27T03:39:29+0000',
             u'from': {u'id': u'10152465206977702', u'name': u'Peter Ho'},
             u'id': u'1049191648441017',
             u'like_count': 0,
             u'message': u'Peter the slide share has 404 message?',
             u'user_likes': False},
            {u'can_remove': True,
             u'created_time': u'2015-02-27T03:43:23+0000',
             u'from': {u'id': u'10152075362431725',
                       u'name': u'Tirath Ramdas'},
             u'id': u'1049192758440906',
             u'like_count': 0,
             u'message': u'Works for me',
             u'user_likes': False},
            {u'can_remove': True,
             u'created_time': u'2015-02-27T03:43:46+0000',
             u'from': {u'id': u'10152934839784580', u'name': u'Peter Kua'},
             u'id': u'1049192845107564',
             u'like_count': 0,
             u'message': u'works from side too Peter Ho',
             u'message_tags': [{u'id': u'10152465206977702',
                                u'length': 8,
                                u'name': u'Peter Ho',
                                u'offset': 20,
                                u'type': u'user'}],
             u'user_likes': False},
            {u'can_remove': True,
             u'created_time': u'2015-02-27T03:44:16+0000',
             u'from': {u'id': u'10152465206977702', u'name': u'Peter Ho'},
             u'id': u'1049193048440877',
             u'like_count': 0,
             u'message': u'Must be me then',
             u'user_likes': False},
            {u'can_remove': True,
             u'created_time': u'2015-02-27T03:47:59+0000',
             u'from': {u'id': u'10100974540589758',
                       u'name': u'Daniel Jean-Pierre Riveong'},
             u'id': u'1049194295107419',
             u'like_count': 1,
             u'message': u'Slideshare link doesn\'t work for me. It looks like so: "http://www.slideshare.net/p\u2026/big-data-week-kuala-lumpur-2015"',
             u'user_likes': False},
            {u'can_remove': True,
             u'created_time': u'2015-02-27T03:48:21+0000',
             u'from': {u'id': u'10100974540589758',
                       u'name': u'Daniel Jean-Pierre Riveong'},
             u'id': u'1049194378440744',
             u'like_count': 1,
             u'message': u'Clicking on the image works though.',
             u'user_likes': False},
            {u'can_remove': True,
             u'created_time': u'2015-02-27T03:55:29+0000',
             u'from': {u'id': u'10152415710124319',
                       u'name': u'Ng Swee Meng'},
             u'id': u'1049196715107177',
             u'like_count': 0,
             u'message': u'It feels different from the first one...',
             u'user_likes': False},
            {u'can_remove': True,
             u'created_time': u'2015-02-27T04:06:48+0000',
             u'from': {u'id': u'10202993457636721',
                       u'name': u'Balaganesh Latchmanan'},
             u'id': u'1049199875106861',
             u'like_count': 0,
             u'message': u'Murali Shankar',
             u'message_tags': [{u'id': u'10152872458587148',
                                u'length': 14,
                                u'name': u'Murali Shankar',
                                u'offset': 0,
                                u'type': u'user'}],
             u'user_likes': False},
            {u'can_remove': True,
             u'created_time': u'2015-02-27T04:19:25+0000',
             u'from': {u'id': u'10152075362431725',
                       u'name': u'Tirath Ramdas'},
             u'id': u'1049203205106528',
             u'like_count': 1,
             u'message': u"Ah, right, yes, the link in Peter's message somehow got mangled, but the link that Facebook extracted into the preview image does work :) so, just click on the image.",
             u'message_tags': [{u'id': u'10152934839784580',
                                u'length': 5,
                                u'name': u'Peter',
                                u'offset': 28,
                                u'type': u'user'}],
             u'user_likes': False},
            {u'can_remove': True,
             u'created_time': u'2015-02-27T04:27:31+0000',
             u'from': {u'id': u'10152528794631844',
                       u'name': u'Sandra Hanchard'},
             u'id': u'1049206171772898',
             u'like_count': 1,
             u'message': u'Here is the link again: http://www.slideshare.net/petekua/big-data-week-kuala-lumpur-2015',
             u'user_likes': True},
            {u'can_remove': True,
             u'created_time': u'2015-02-27T04:36:10+0000',
             u'from': {u'id': u'10152322247874367',
                       u'name': u'Heislyc Loh'},
             u'id': u'1049208688439313',
             u'like_count': 0,
             u'message': u'Support!',
             u'user_likes': False},
            {u'can_remove': True,
             u'created_time': u'2015-02-27T04:46:50+0000',
             u'from': {u'id': u'10152934839784580', u'name': u'Peter Kua'},
             u'id': u'1049211788439003',
             u'like_count': 1,
             u'message': u'I corrected the mangled link :p',
             u'user_likes': True},
            {u'can_remove': True,
             u'created_time': u'2015-02-27T05:01:58+0000',
             u'from': {u'id': u'10152934839784580', u'name': u'Peter Kua'},
             u'id': u'1049216228438559',
             u'like_count': 1,
             u'message': u'btw the 2-day expo is FOC',
             u'user_likes': False},
            {u'can_remove': True,
             u'created_time': u'2015-02-27T06:39:24+0000',
             u'from': {u'id': u'10155151622655483',
                       u'name': u'Norhidayah Azman'},
             u'id': u'1049240855102763',
             u'like_count': 0,
             u'message': u'how do we send proposals for thu/fri? any example proposals we could use as a guide?',
             u'user_likes': False},
            {u'can_remove': True,
             u'created_time': u'2015-02-27T09:26:37+0000',
             u'from': {u'id': u'10152934839784580', u'name': u'Peter Kua'},
             u'id': u'1049283101765205',
             u'like_count': 3,
             u'message': u"Norhidayah Azman we don't have a specific proposal template as it is free form. you can organize a bda workshop, demo, tutorial, hackathon, etc. let us know about it and we will work with you to make sure it gets maximim exposure and is a success.",
             u'message_tags': [{u'id': u'10155151622655483',
                                u'length': 16,
                                u'name': u'Norhidayah Azman',
                                u'offset': 0,
                                u'type': u'user'}],
             u'user_likes': True},
            {u'can_remove': True,
             u'created_time': u'2015-02-27T09:40:43+0000',
             u'from': {u'id': u'1102366916447454',
                       u'name': u'S S Mohd Fauzi'},
             u'id': u'1049286985098150',
             u'like_count': 0,
             u'message': u'Any opportunity for the researcher? i.e to discuss findings etc?',
             u'user_likes': False},
            {u'can_remove': True,
             u'created_time': u'2015-02-27T16:28:55+0000',
             u'from': {u'id': u'10152075362431725',
                       u'name': u'Tirath Ramdas'},
             u'id': u'1049460655080783',
             u'like_count': 1,
             u'message': u"S S Mohd Fauzi last year UTAR organised a data mining workshop: http://www.utar.edu.my/econtent_sub.jsp?fcatid=16&fcontentid=10554. IMHO there is certainly room for an academic component to the week. I don't know if UTAR will be doing it again, but I hope some academic institution will do something.",
             u'message_tags': [{u'id': u'1102366916447454',
                                u'length': 14,
                                u'name': u'S S Mohd Fauzi',
                                u'offset': 0,
                                u'type': u'user'}],
             u'user_likes': False},
            {u'can_remove': True,
             u'created_time': u'2015-02-27T16:48:26+0000',
             u'from': {u'id': u'10155151622655483',
                       u'name': u'Norhidayah Azman'},
             u'id': u'1049468408413341',
             u'like_count': 4,
             u'message': u"Funny u should say that Tirath :D\nI'm a lecturer at USIM, Nilai, and I was thinking exactly along those lines - adding an academic component to BDW :)\nI went to last year's BDW, and I'm keen to setup an event this year, but I'm a Big Data newbie myself! And we're a bit short on staff who can lead workshops and such. So I'm still mulling how best to proceed. I'm thinking a discussion panel of sorts. What would you guys like to see for an academic component to BDW?",
             u'message_tags': [{u'id': u'10152075362431725',
                                u'length': 6,
                                u'name': u'Tirath',
                                u'offset': 24,
                                u'type': u'user'}],
             u'user_likes': False},
            {u'can_remove': True,
             u'created_time': u'2015-02-27T23:06:38+0000',
             u'from': {u'id': u'10152528794631844',
                       u'name': u'Sandra Hanchard'},
             u'id': u'1049615545065294',
             u'like_count': 1,
             u'message': u'There were at least 4 Unis hosting events last year (UTAR, Taylors, Sunway, MMU) if I recall correctly. Norhidayah Azman how about a panel debating ethics / privacy / surveillance concerns of big data? Or access to big data by humanities fields / digitial methods / skill gaps / complementing big data with small data / societal problems that can or cant be addressed? Depends on background of your department I guess :)',
             u'message_tags': [{u'id': u'10155151622655483',
                                u'length': 16,
                                u'name': u'Norhidayah Azman',
                                u'offset': 104,
                                u'type': u'user'}],
             u'user_likes': False},
            {u'can_remove': True,
             u'created_time': u'2015-02-28T00:07:17+0000',
             u'from': {u'id': u'10152075362431725',
                       u'name': u'Tirath Ramdas'},
             u'id': u'1049636835063165',
             u'like_count': 3,
             u'message': u"I didn't attend the UTAR event, but from some feedback and their own report it is evident that the academics got an opportunity to present some of their students research findings (though I have no idea if they had a CFP or how exactly they picked speakers), so it was a good academic event, and Sandra is quite right, other academic institutions organised other things too, including Taylors. Taylors and UTAR both featured international speakers as well. MMU I believe hosted a hackathon, which they graciously opened to the public. There were so many things going on during BDW'14 that I might have forgotten some :)\n\nI think BDW would be a good time to organise such an academic event because you will get free promotional support, and also there will be some international guests who can attend your event. That said, I know the logistics are not trivial, so better start on it right now! In BDW'13 also there was interest in creating such a track, but it just could not come together due to the challenge of logistics.\n\nNorhidayah Azman if you are interested, I strongly urge you to chat with Peter Kua immediately to explore the idea - even if in the end tak jadi never mind, the important thing is to start very soon or there will be no way it can happen. By involving MDeC from the start at least they can reduce the odds of clashes for the time slot, and link you up with other interested parties.",
             u'message_tags': [{u'id': u'10155151622655483',
                                u'length': 16,
                                u'name': u'Norhidayah Azman',
                                u'offset': 1026,
                                u'type': u'user'},
                               {u'id': u'10152934839784580',
                                u'length': 9,
                                u'name': u'Peter Kua',
                                u'offset': 1099,
                                u'type': u'user'}],
             u'user_likes': False},
            {u'can_remove': True,
             u'created_time': u'2015-02-28T00:32:58+0000',
             u'from': {u'id': u'10152934839784580', u'name': u'Peter Kua'},
             u'id': u'1049645301728985',
             u'like_count': 2,
             u'message': u'Norhidayah Azman, you can download the BDWKL2014 post-mortem report and read about the various events hosted by the 4 universities. The link: http://bigdataanalytics.my/downloads/BDW2014-Post-Event-Report.pdf. We will be more than happy to work with you on your idea.',
             u'message_tags': [{u'id': u'10155151622655483',
                                u'length': 16,
                                u'name': u'Norhidayah Azman',
                                u'offset': 0,
                                u'type': u'user'}],
             u'user_likes': True},
            {u'can_remove': True,
             u'created_time': u'2015-02-28T00:37:21+0000',
             u'from': {u'id': u'1102366916447454',
                       u'name': u'S S Mohd Fauzi'},
             u'id': u'1049646511728864',
             u'like_count': 3,
             u'message': u'Yeah, why not we organize BD workshop and do CFP (it is some kind of mini seminar to present findings related to big data). It is a good start to engage academicians with industry...',
             u'user_likes': True},
            {u'can_remove': True,
             u'created_time': u'2015-02-28T00:37:21+0000',
             u'from': {u'id': u'10152934839784580', u'name': u'Peter Kua'},
             u'id': u'1049646515062197',
             u'like_count': 1,
             u'message': u'Tang Wern Tien',
             u'message_tags': [{u'id': u'10153393479107259',
                                u'length': 14,
                                u'name': u'Tang Wern Tien',
                                u'offset': 0,
                                u'type': u'user'}],
             u'user_likes': False},
            {u'can_remove': True,
             u'created_time': u'2015-02-28T00:45:50+0000',
             u'from': {u'id': u'10152075362431725',
                       u'name': u'Tirath Ramdas'},
             u'id': u'1049649165061932',
             u'like_count': 0,
             u'message': u"S S Mohd Fauzi that would be great, but I am mindful of the fact that there's not much time left, so maybe the CFP could be limited to abstracts for posters from students, with invited speakers? Or some other configuration...",
             u'message_tags': [{u'id': u'1102366916447454',
                                u'length': 14,
                                u'name': u'S S Mohd Fauzi',
                                u'offset': 0,
                                u'type': u'user'}],
             u'user_likes': False},
            {u'can_remove': True,
             u'created_time': u'2015-02-28T00:51:42+0000',
             u'from': {u'id': u'1102366916447454',
                       u'name': u'S S Mohd Fauzi'},
             u'id': u'1049652655061583',
             u'like_count': 2,
             u'message': u'Yes, considering the time left, CFP for abstract, posters would be great. But, where to start? I can come out with CFP.',
             u'user_likes': True}],
  u'paging': {u'cursors': {u'after': u'WTI5dGJXVnVkRjlqZFhKemIzSTZNVEEwT1RZMU1qWTFOVEEyTVRVNE16b3hOREkxTURnME56QXk=',
                           u'before': u'WTI5dGJXVnVkRjlqZFhKemIzSTZNVEEwT1RFNU1UWTBPRFEwTVRBeE56b3hOREkxTURBNE16WTU='},
              u'next': u'https://graph.facebook.com/v2.0/497068793653308_1049188861774629/comments?access_token=CAACEdEose0cBAAe7p7jqgiBC2WSxdVY24YgpnNob6a7fPEMP16LZBP2JUP99n7xqpu3C84g4X6pJ932ZA6JYwtHES6DfhKxKexqUkhdIYpU0ocN0Wozah4mzdtlTNhjwGj6xcPobZCvQbvLSIERbKFFtg2NpLyF6VCCe75U5oZBVsjzxxZAKC1c0CCwvGkGJUZAbKz6VzS3jnqckrwUfg7dGzQLnXQzU4ZD%0A&limit=25&after=WTI5dGJXVnVkRjlqZFhKemIzSTZNVEEwT1RZMU1qWTFOVEEyTVRVNE16b3hOREkxTURnME56QXk%3D'}},
 {u'data': [{u'can_remove': True,
             u'created_time': u'2015-02-28T01:16:19+0000',
             u'from': {u'id': u'10152322247874367',
                       u'name': u'Heislyc Loh'},
             u'id': u'1049659495060899',
             u'like_count': 0,
             u'message': u"What's CFP?",
             u'user_likes': False},
            {u'can_remove': True,
             u'created_time': u'2015-02-28T01:17:37+0000',
             u'from': {u'id': u'1102366916447454',
                       u'name': u'S S Mohd Fauzi'},
             u'id': u'1049660125060836',
             u'like_count': 1,
             u'message': u'Its is call for paper (CFP)',
             u'user_likes': False},
            {u'can_remove': True,
             u'created_time': u'2015-02-28T01:27:27+0000',
             u'from': {u'id': u'10152322247874367',
                       u'name': u'Heislyc Loh'},
             u'id': u'1049663078393874',
             u'like_count': 2,
             u'message': u'[yet to be concrete] I\'d like to propose a pre-hackathon workshop, to work with data scientist and practitioners, data source providers and custodian (public & private), to work towards crunching / capturing / collecting a sample data sets, to prep. upcoming hackathon, e.g. AngelHack (June), BDA (MDeC-Q4) etc. Don\'t have a framework yet, would like to invite inputs.\n\nRational:\n1) You can\'t just go straight into hackathon without much prep works for Big Data project, previous outcome doesn\'t seems impactful enough to me\n\n2) It is a common challenge to look for relevant data sets for the purpose of hackathon\n\n3) Hackathon is essentially a project-based leaning activity\n\n4) I think this is useful to attract more data-driven developer and business manager interest\n\nIn a nutshell, we could see this as a "dataset prep. workshop"',
             u'user_likes': False},
            {u'can_remove': True,
             u'created_time': u'2015-02-28T01:36:11+0000',
             u'from': {u'id': u'10152528794631844',
                       u'name': u'Sandra Hanchard'},
             u'id': u'1049665565060292',
             u'like_count': 2,
             u'message': u"Great idea Heislyc Loh In any quant analysis, a significant amount of work always goes into data preparation & it's important for quality of outputs/insights. If the workshop included showcasing a number of tools that can expedite data preparation; as well as discussion of sensitivities of using third-party data sources - that would be great contribution.",
             u'message_tags': [{u'id': u'10152322247874367',
                                u'length': 11,
                                u'name': u'Heislyc Loh',
                                u'offset': 11,
                                u'type': u'user'}],
             u'user_likes': True},
            {u'can_remove': True,
             u'created_time': u'2015-02-28T02:41:23+0000',
             u'from': {u'id': u'10152075362431725',
                       u'name': u'Tirath Ramdas'},
             u'id': u'1049687728391409',
             u'like_count': 1,
             u'message': u'S S Mohd Fauzi I think the starting point should be to figure out logistics options, in particular what suitable venues might be available when. That will be sufficient information for the initial CFP (e.g. if venue options are all in KL you can say "venue TBD but within Kuala Lumpur", and lock in a specific date as well), later you can update the CFP with the specific location and time, and then work out invited speaker spots, minimal F&B, and sponsorship requirements... best to discuss with MDeC and other who have expressed interest above!\n\nHeislyc Loh the pre-hackathon workshop idea is definitely a good one, as you say it\'s an opportunity to pick up skills but also to form teams. It is something we tried to do in BDW\'13, but unfortunately that year we decided to abort the hackathon idea because there was too much uncertainty around the date of the General Election (it looked like the GE could fall on the weekend we were planning to have the hackathon... turned out to be the week later, but we didn\'t know that in time). Despite dropping the hackathon, the workshop still proceeded, thanks to Ng Swee Meng.\n\nAll good ideas everyone, you should push through with them! I wish I could help out more, but alas this year I am squarely in the NATO* category because I will most likely be overseas for the entire period. (*No Action, Talk Only :P)',
             u'message_tags': [{u'id': u'1102366916447454',
                                u'length': 14,
                                u'name': u'S S Mohd Fauzi',
                                u'offset': 0,
                                u'type': u'user'},
                               {u'id': u'10152322247874367',
                                u'length': 11,
                                u'name': u'Heislyc Loh',
                                u'offset': 549,
                                u'type': u'user'},
                               {u'id': u'10152415710124319',
                                u'length': 12,
                                u'name': u'Ng Swee Meng',
                                u'offset': 1110,
                                u'type': u'user'}],
             u'user_likes': False},
            {u'can_remove': True,
             u'created_time': u'2015-02-28T02:57:01+0000',
             u'from': {u'id': u'10155151622655483',
                       u'name': u'Norhidayah Azman'},
             u'id': u'1049696108390571',
             u'like_count': 3,
             u'message': u"My department does Information Security and Assurance, so we can gladly organize a panel on security/ethics/privacy/surveillance :D\nI'm aware that there were other security tracks at BDW14, will there be any clashes this year?\n\nAbt CFPs, I'm open to inviting researchers to present their work - S S Mohd Fauzi I could ask my bosses if we could host your CFP. But it's unlikely the papers will be peer reviewed, let alone get published/indexed. Will this be ok?",
             u'message_tags': [{u'id': u'1102366916447454',
                                u'length': 14,
                                u'name': u'S S Mohd Fauzi',
                                u'offset': 295,
                                u'type': u'user'}],
             u'user_likes': True},
            {u'can_remove': True,
             u'created_time': u'2015-02-28T03:00:39+0000',
             u'from': {u'id': u'1102366916447454',
                       u'name': u'S S Mohd Fauzi'},
             u'id': u'1049699815056867',
             u'like_count': 0,
             u'message': u'That would be great a great start...',
             u'user_likes': False},
            {u'can_remove': True,
             u'created_time': u'2015-02-28T03:01:12+0000',
             u'from': {u'id': u'1102366916447454',
                       u'name': u'S S Mohd Fauzi'},
             u'id': u'1049701061723409',
             u'like_count': 1,
             u'message': u'Norhidayah Azman',
             u'message_tags': [{u'id': u'10155151622655483',
                                u'length': 16,
                                u'name': u'Norhidayah Azman',
                                u'offset': 0,
                                u'type': u'user'}],
             u'user_likes': False},
            {u'can_remove': True,
             u'created_time': u'2015-03-02T06:42:36+0000',
             u'from': {u'id': u'10152075362431725',
                       u'name': u'Tirath Ramdas'},
             u'id': u'1050910038269178',
             u'like_count': 0,
             u'message': u"Perhaps you could do a basic peer review just to ensure relevance, but maybe save novelty+impact scoring for next time. Since reviews can be done remotely I'd be happy to assist with that if it would help (I used to review CS papers years ago - rusty now, but hopefully a bit like riding a bike).",
             u'user_likes': False},
            {u'can_remove': True,
             u'created_time': u'2015-03-02T11:37:18+0000',
             u'from': {u'id': u'10152184193128773',
                       u'name': u'Norashikin Abdul Hamid'},
             u'id': u'1051069438253238',
             u'like_count': 1,
             u'message': u"what about free opportunity for newbies/startups to try out all the different tools related to big data and vendors providing support? \nHi Peter Kua! Didn't get to catch up with you further on this topic :)",
             u'message_tags': [{u'id': u'10152934839784580',
                                u'length': 9,
                                u'name': u'Peter Kua',
                                u'offset': 139,
                                u'type': u'user'}],
             u'user_likes': False},
            {u'can_remove': True,
             u'created_time': u'2015-03-03T14:40:26+0000',
             u'from': {u'id': u'10155151622655483',
                       u'name': u'Norhidayah Azman'},
             u'id': u'1051719904854858',
             u'like_count': 1,
             u'message': u"Sorry guys but it's a no go on my side - the date's too soon and there's too much paperwork to get everything done in time! :(\nHowever, I'm very happy to pitch in and collaborate if anybody needs a hand with their events!",
             u'user_likes': True},
            {u'can_remove': True,
             u'created_time': u'2015-03-04T07:19:11+0000',
             u'from': {u'id': u'10152075362431725',
                       u'name': u'Tirath Ramdas'},
             u'id': u'1052149544811894',
             u'like_count': 0,
             u'message': u"Thanks for trying anyway Norhidayah Azman. I hope you and others are not too discouraged - although you are right that it will be very hard to organize in time for BDW'15, perhaps it could be organized as a standalone event slated for a few months after BDW'15? Say a small workshop with a minimal CFP component, then with more forward planning a full conference to coincide with BDW'16?",
             u'message_tags': [{u'id': u'10155151622655483',
                                u'length': 16,
                                u'name': u'Norhidayah Azman',
                                u'offset': 25,
                                u'type': u'user'}],
             u'user_likes': False},
            {u'can_remove': True,
             u'created_time': u'2015-03-04T12:14:17+0000',
             u'from': {u'id': u'10155151622655483',
                       u'name': u'Norhidayah Azman'},
             u'id': u'1052237138136468',
             u'like_count': 2,
             u'message': u"Sounds good :) Do we already hv the dates for BDW'16? Gonna need to write up a kertas kerja asap :P",
             u'user_likes': True},
            {u'can_remove': True,
             u'created_time': u'2015-03-05T12:23:20+0000',
             u'from': {u'id': u'10153393479107259',
                       u'name': u'Tang Wern Tien'},
             u'id': u'1052813858078796',
             u'like_count': 4,
             u'message': u"Hi guys, If you are interested to explore on the prospect of hosting a partner event in this year's BDW, please drop me an email asap with your contact details and we can discuss from there. My email is werntien@mdec.com.my.",
             u'user_likes': False},
            {u'can_remove': True,
             u'created_time': u'2015-03-06T03:45:25+0000',
             u'from': {u'id': u'10152934839784580', u'name': u'Peter Kua'},
             u'id': u'1053171694709679',
             u'like_count': 1,
             u'message': u'Norhidayah Azman, no concrete dates for BDW16 yet, but it is usually held end Q1 or beginning Q2 :)',
             u'message_tags': [{u'id': u'10155151622655483',
                                u'length': 16,
                                u'name': u'Norhidayah Azman',
                                u'offset': 0,
                                u'type': u'user'}],
             u'user_likes': False}],
  u'paging': {u'cursors': {u'after': u'WTI5dGJXVnVkRjlqZFhKemIzSTZNVEExTXpFM01UWTVORGN3T1RZM09Ub3hOREkxTmpFek5USTE=',
                           u'before': u'WTI5dGJXVnVkRjlqZFhKemIzSTZNVEEwT1RZMU9UUTVOVEEyTURnNU9Ub3hOREkxTURnMk1UYzU='},
              u'previous': u'https://graph.facebook.com/v2.0/497068793653308_1049188861774629/comments?limit=25&access_token=CAACEdEose0cBAAe7p7jqgiBC2WSxdVY24YgpnNob6a7fPEMP16LZBP2JUP99n7xqpu3C84g4X6pJ932ZA6JYwtHES6DfhKxKexqUkhdIYpU0ocN0Wozah4mzdtlTNhjwGj6xcPobZCvQbvLSIERbKFFtg2NpLyF6VCCe75U5oZBVsjzxxZAKC1c0CCwvGkGJUZAbKz6VzS3jnqckrwUfg7dGzQLnXQzU4ZD%0A&before=WTI5dGJXVnVkRjlqZFhKemIzSTZNVEEwT1RZMU9UUTVOVEEyTURnNU9Ub3hOREkxTURnMk1UYzU%3D'}}]

Skimming the above it looks as though very long comment threads are split into multiple "pages" in the comments list. This may be an artifact of the paging code in pull_feed.py, which is not ideal. At some point we may fix it there, but for the time being we'll just consider it a data quality inconvenience that we will have to deal with.

Here's a function to work around this annoyance:


In [10]:
def flatten_comments_pages(post):
    flattened_comments = []
    for page in post:
        flattened_comments += page['data']
    return flattened_comments

post_comments_paged = multi_item_comment_lists[0]
print "Post has {} comments".format(len(flatten_comments_pages(post_comments_paged)))


Post has 40 comments

Start plotting things already dammit

Now that we're counting comments, it's natural to ask: what does the number-of-comments-per-post distribution look like?

IMPORTANT NOTE: Beyond this point, we start to "follow the data" as we analyse things, and we do so in a time-relative way (e.g. comparing the last N days of posts to historical data). As Big Data Malaysia is a living breathing group, the data set is a living breathing thing, so things may change, and the conclusions informing the analysis here may suffer logic rot.


In [11]:
comments_threads = [data['comments'] for data in big_data if 'comments' in data]
count_of_posts_with_no_comments = len(big_data) - len(comments_threads)
comments_counts = [0] * count_of_posts_with_no_comments
comments_counts += [len(flatten_comments_pages(thread)) for thread in comments_threads]

import matplotlib.pyplot as plt
plt.hist(comments_counts, bins=max(comments_counts))
plt.title("Comments-per-post Histogram")
plt.xlabel("Comments per post")
plt.ylabel("Frequency")
plt.show()


This sort of adds up intuitively; posts with long comment threads will be rare, though from experience with this forum it does not seem right to conclude that there is a lot of posting going on with no interaction... the community is a bit more engaged than that.

But since this is Facebook, comments aren't the only way of interacting with a post. There's also the wonderful 'Like'.


In [12]:
likes_threads = [data['likes']['data'] for data in big_data if 'likes' in data]
count_of_posts_with_no_likes = len(big_data) - len(likes_threads)
likes_counts = [0] * count_of_posts_with_no_likes
likes_counts += [len(thread) for thread in likes_threads]

plt.hist(likes_counts, bins=max(likes_counts))
plt.title("Likes-per-post Histogram")
plt.xlabel("Likes per post")
plt.ylabel("Frequency")
plt.show()


Note that the above does not include Likes on Comments made on posts; only Likes made on posts themselves are counted.

While this paints the picture of a more engaged community, it still doesn't feel quite right. It seems unusual these days to find a post go by without a Like or two.

I have a hunch that the zero-like posts are skewed a bit to the earlier days of the group. To dig into that we'll need to start playing with timestamps. Personally I prefer to deal with time as UTC epoch seconds, and surprisingly it seems I need to write my own helper function for this.


In [13]:
import datetime
import dateutil
import pytz

def epoch_utc_s(date_string):
    dt_local = dateutil.parser.parse(str(date_string))
    dt_utc = dt_local.astimezone(pytz.utc)
    nineteenseventy = datetime.datetime(1970,1,1)
    epoch_utc = dt_utc.replace(tzinfo=None) - nineteenseventy
    return int(epoch_utc.total_seconds())

posts_without_likes = [data for data in big_data if 'likes' not in data]
posts_with_likes = [data for data in big_data if 'likes' in data]
timestamps_of_posts_without_likes = [epoch_utc_s(post['created_time']) for post in posts_without_likes]
timestamps_of_posts_with_likes = [epoch_utc_s(post['created_time']) for post in posts_with_likes]

import numpy
median_epoch_liked = int(numpy.median(timestamps_of_posts_with_likes))
median_epoch_non_liked = int(numpy.median(timestamps_of_posts_without_likes))
print "Median timestamp of posts without likes: {} ({})".format(datetime.datetime.fromtimestamp(median_epoch_non_liked),
                                                                median_epoch_non_liked)
print "Median timestamp of posts with likes: {} ({})".format(datetime.datetime.fromtimestamp(median_epoch_liked),
                                                             median_epoch_liked)


Median timestamp of posts without likes: 2014-04-25 03:08:38 (1398359318)
Median timestamp of posts with likes: 2014-08-29 03:13:29 (1409246009)

In general it seems my hunch may have been right, but it will be clearer if we plot it.


In [14]:
plt.hist(timestamps_of_posts_without_likes, alpha=0.5, label='non-Liked posts')
plt.hist(timestamps_of_posts_with_likes, alpha=0.5, label='Liked posts')
plt.title("Liked vs non-Liked posts")
plt.xlabel("Time (epoch UTC s)")
plt.ylabel("Count of posts")
plt.legend(loc='upper left')
plt.show()


This is looking pretty legit now. We can see that lately there's been a significant uptick in the number of posts, and an uptick in the ratio of posts that receive at least one Like.

As another sanity check, we can revisit the Likes-per-post Histogram, but only include recent posts. While we're at it we might as well do the same for the Comments-per-post Histogram.


In [15]:
def less_than_n_days_ago(date_string, n):
    query_date = epoch_utc_s(date_string)
    today_a_year_ago = epoch_utc_s(datetime.datetime.now(pytz.utc) - datetime.timedelta(days=n))
    return query_date > today_a_year_ago

# try changing this variable then re-running this cell...
days_ago = 30

# create a slice of our big_data containing only posts created n days ago
recent_data = [data for data in big_data if less_than_n_days_ago(data['created_time'], days_ago)]

# plot the Likes-per-post Histogram for recent_data
recent_likes_threads = [data['likes']['data'] for data in recent_data if 'likes' in data]
recent_count_of_posts_with_no_likes = len(recent_data) - len(recent_likes_threads)
recent_likes_counts = [0] * recent_count_of_posts_with_no_likes
recent_likes_counts += [len(thread) for thread in recent_likes_threads]

plt.hist(recent_likes_counts, bins=max(recent_likes_counts))
plt.title("Likes-per-post Histogram (last {} days)".format(days_ago))
plt.xlabel("Likes per post")
plt.ylabel("Frequency")
plt.show()

# plot the Comment-per-post Histogram for recent_data
recent_comments_threads = [data['comments'] for data in recent_data if 'comments' in data]
recent_count_of_posts_with_no_comments = len(recent_data) - len(comments_threads)
recent_comments_counts = [0] * recent_count_of_posts_with_no_comments
recent_comments_counts += [len(flatten_comments_pages(thread)) for thread in recent_comments_threads]

plt.hist(recent_comments_counts, bins=max(recent_comments_counts))
plt.title("Comments-per-post Histogram (last {} days)".format(days_ago))
plt.xlabel("Comments per post")
plt.ylabel("Frequency")
plt.show()


At the time of writing, histogramming Comments-per-post and Likes-per-post for only posts made within the last 30 days revelaed some interesting things:

  • In this period there were very few posts that received 0 likes. But more interestingly...
  • There were 0 posts that received 0 comments. That means there were posts that did not get any likes that got at least 1 comment.

Clicking 'Like' is a pretty cheap/lazy form of engagement. It takes a bit more effort to write and send a comment. There is too little here to form any conclusions, but it does make one think about how we may measure engagement levels, and whether or not this bodes well as an early indicator that the Big Data Malaysia community is becoming more engaged.

Coming soon...

There is much more that can be done, but let's pause for now to review what we've done up to here:

  • We've loaded a JSON dataset.
  • We've dealt with a bit of an annoying data quality issue (i.e. the unfortunate business with the sloppy comments field paging).
  • We've parsed timestamps from strings.
  • We've done some slicing and dicing of the dataset with list comprehensions and itertools to take slices of our data that matched some criteria (i.e. posts with comments, posts created within the last 30 days).
  • We've plotted some histograms.
  • We've just about started to scratch the surface with numpy and pandas - much more to come here.

The reward to this stuff is learning something new, and by the end of this exercise we learned a little bit about the engagement habits of the Big Data Malaysia community. Soon we will pick up where this notebook has left off, and dig deeper into that issue.